Data Dictionary

Importing necessary libraries and data

Data Overview

Questions

FIRST LOOK

Initial Observations

Exploratory Data Analysis (EDA)

Questions:

  1. What does the distribution of used phone prices look like?
  2. What percentage of the used phone market is dominated by Android devices?
  3. The amount of RAM is important for the smooth functioning of a phone. How does the amount of RAM vary with the brand?
  4. A large battery often increases a phone's weight, making it feel uncomfortable in the hands. How does the weight vary for phones offering large batteries (more than 4500 mAh)?
  5. Bigger screens are desirable for entertainment purposes as they offer a better viewing experience. How many phones are available across different brands with a screen size larger than 6 inches?
  6. Budget phones nowadays offer great selfie cameras, allowing us to capture our favorite moments with loved ones. What is the distribution of budget phones offering greater than 8MP selfie cameras across brands?
  7. Which attributes are highly correlated with the used phone price?

Observation: Most used phones are under 500 Euros and there appears to be some cosing 2000

The top available brands are Samsung, Huawei, LG, and Lenovo

Observation: It looks like 4GB RAM phones from older years are still available, which could useful for customers on a budget

Observation: There appears to be an outlier in the top left. How can a phone with an enormous battery and screen size weigh so little?

Observation: There is a positive relationship, but it's not as obvious because most of the main cameras that have 10-12GB have a very wide spread in new price. It also seems like most of the newer models in the dataset stay in the 8-12GB range

There are some lightweight phones available with large batteries, but mostly as the battery increases, so does the weight

Observation: There is a good amount of availability of phones from many different brands that have screen sizes greater than 6 inches

Observations

The Selfie Cameras get better year after year, for all operating systems

Observation: A lot of the newer models have been used less so they might be in better condition

Observation: The Selfie camera has gotten way better in recent years while the main camera has not

Questions

Data Preprocessing

Duplicate value check - Missing value treatment - Outlier treatment - Feature engineering - Data preparation for modeling

We already checked for duplicates and there are none. Moving on to check for missing values.

Replacing/Dropping missing values and outliers

EDA

Observation: The correlation values changed but the positive/negative signs did not

Observation: Everything looks good and we can tell that outliers were capped. We will build and test the model with and without outlier treatment in the next step

Building a Linear Regression model

Model performance evaluation

Observation: The first model does really well with the testing set at .957

Checking Linear Regression Assumptions

TEST FOR MULTICOLLINEARITY

There are up to 10 features with a VIF over 5. The largest is release_year. So let's drop them one by one and check the Adjusted R Squared on the training data

We don't want to remove days_used because it hurt the Adj R-squared the most. Instead we will remove battery and then weight and check the VIFs

Battery and Weight are related, but removing them didn't change the VIF of any of the other features, so let's check the model score on the TRAINING set by removing them both

Removing both battery and weight resulted in better scores. Moving on.

TEST FOR LINEARITY AND INDEPENDENCE

We see no pattern in the plot above. Hence, the assumptions of linearity and independence are satisfied.

TEST FOR NORMALITY

TEST FOR HOMOSCEDASTICITY

Since p-value > 0.05, we can say that the residuals are homoscedastic. So, this assumption is satisfied.

Final Model Summary

Including all the Coefficients

The final linear model captures up to 96% of the test data

Here's what we did:

Observations

Actionable Insights and Recommendations